Back

Journal of Open Source Software

The Open Journal

Preprints posted in the last 30 days, ranked by how well they match Journal of Open Source Software's content profile, based on 22 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.

1
SaVanache: indexing and visualizing pangenome variation graphs

Mohamed, M.; Durant, E.; Rouard, M.; Muller, C.; Monat, C.; Conte, M.; Sabot, F.

2026-05-08 bioinformatics 10.64898/2026.05.05.722901 medRxiv
Top 0.1%
2.7%
Show abstract

With the rapid increase in genome sequencing and the growing availability of genomic resources, genomics is shifting toward pangenome representations that capture intra- and inter-specific diversity by integrating multiple genomes into a single entity. These pangenomes are increasingly modeled as graphs, encoding complex genomic variations in structures such as de Bruijn or variation graphs. However, while genome browsers provide standard and effective solutions for visualizing single or limited numbers of genomes, equivalent interactive tools for graph-based pangenomes remain limited, particularly for variation graph models. We developed SaVanache, a multi-resolution visualization interface designed to explore pangenome variation graphs at various depths. SaVanache enables the exploration of both global diversity and structural variations (SVs) across genomes relative to a user-defined linear pivot genome. Unlike synteny viewers, SaVanache emphasizes variations by representing SV types through a dedicated set of glyphs, facilitating intuitive one-to-many comparisons. To support smooth exploration, SaVanache preprocesses a Graphical Fragment Assembly (GFA) pangenome file into optimized index and data structures, enabling fast, real-time queries on large pangenome graphs. By combining advanced visualization techniques with efficient data handling, SaVanache provides a robust tool for scientists to analyze and visualize genetic variation within genomes and pangenomes, facilitating the identification of genetic determinants associated with phenotypes of interest and fully exploiting current genomic resources. Author summaryWe introduce SaVanache, an innovative tool that transforms the way we explore genomic resources. SaVanache allows visualization and analysis of pangenome variation graphs (PVGs), which capture genomic diversity by integrating structural variants (SV) and single nucleotide polymorphisms (SNPs) across multiple genomes. Unlike traditional genome browsers limited to a few genomes, SaVanache offers a multi-level, user-friendly interface that allows users to explore from whole pangenomes down to individual structural variants, enabling multidimensional research and development. Using a linear pivot genome as a visual reference, SaVanache simplifies complex PVG structures into intuitive comparisons. It efficiently handles large datasets and speeds up data retrieval through internal parsing. The front-end, built with modern JavaScript frameworks, provides interactive and responsive visualization, while the Python/Django backend supports real-time data updates. Users can detect and classify SVs by comparing syntenic segments between genomes, visualized through a novel glyph-based system that uses shapes and colors to represent complex rearrangements. SaVanache supports seamless zooming from chromosome-wide to nucleotide-level views, interactive diversity scatterplots, dynamic pivot genome switching, and grouping genomes by metadata to explore genotype-phenotype links. In addition, export functions bridge visualization with downstream bioinformatics. Developed with user feedback, SaVanache balances biological relevance and computational efficiency, overcoming PVG complexity to empower users with unprecedented insight into genomic diversity and SVs.

2
Figra: A WebAssembly-based Excel Add-in for publication-quality scientific visualization with ggplot2

Sato, Y.

2026-05-12 bioinformatics 10.64898/2026.05.06.723320 medRxiv
Top 0.1%
1.9%
Show abstract

Data visualization is a critical step in scientific communication. Most researchers rely on subscription-based software for this purpose, which requires ongoing licensing costs. Free alternatives such as R and Python offer publication-quality output but demand programming expertise that many researchers do not possess. Artificial intelligence tools can assist with figure generation but remain frustrating when users wish to fine-tune specific visual parameters to their preference. Meanwhile, Microsoft Excel, the most widely used tool for scientific data storage and management, offers limited visualization capabilities, forcing researchers to transfer their data to external software as an extra step before creating figures. Here we present Figra, a free Excel Office Add-in that eliminates this extra step by enabling publication-quality ggplot2-based figure generation directly within Excel, with simple and direct control over every visual option. Figra leverages WebAssembly technology (webR) to execute R code entirely within the browser, requiring no R installation, no subscription, and no server connection. The add-in supports over 20 chart types spanning distribution plots, grouped comparisons, time-series, scatter plots, and specialized curve-fitting analyses. For applicable chart types, Figra performs automated or manual statistical analysis supporting both paired and unpaired designs across two or more groups. Additionally, Figra exports simplified, executable R code that reproduces the displayed figure, serving as an educational tool for researchers wishing to learn ggplot2. Figra is open-source and freely available at https://h20gg702.github.io/figra-pages/index.html while the source code is provided at https://github.com/h20gg702/Figra.

3
MicrobeMS - A MATLAB Toolbox for Microbial Identification Based on Mass Spectrometry

Lasch, P.

2026-05-12 bioinformatics 10.64898/2026.05.08.723807 medRxiv
Top 0.1%
1.6%
Show abstract

1.Over the last two decades, matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-ToF MS) has become the standard method for identifying bacteria and has found a wide range of applications, especially in clinical microbiology. The methods high taxonomic resolution, minimal sample preparation, and complete, ready-to-use commercial systems, which include instrumentation, experimental protocols, spectral databases, and identification analysis software, were key factors in the success of MALDI-ToF MS as the standard for identifying microorganisms in routine diagnostic laboratories. However, despite the availability of these commercial solutions, there is also a growing need for efficient, cost-effective, vendor-neutral databases and analysis tools. These tools would enable the compilation of user-defined mass spectral databases and the testing of new analysis methods and algorithms, particularly in an academic context. To this end, MicrobeMS software has been developed to cover all stages of MALDI-ToF MS-based identification analysis. MicrobeMS is an easy-to-use desktop application for analyzing mass spectra from microorganisms and performing tasks related to spectrum database compilation. It includes routines for direct data import and export, biomarker peak searches, management of spectrum metadata, testing of spectrum quality, supervised and unsupervised identification analysis and intuitive result display. MicrobeMS is implemented in MATLAB and is freely available as MATLAB pcode for Windows and Linux, as well as a standalone application. Over the last fifteen years, the software has undergone continuous development and is now used routinely in various settings at the Centre for Biological Threats and Special Pathogens (ZBS) at the Robert Koch Institute (RKI) in Berlin, Germany, for example in supporting spectrum database compilation, to identify special or rare pathogenic bacteria by advanced identification analysis concepts, or to test in silico MALDI-ToF MS databases derived from microbial genomes. In this software publication the versatility and capabilities of MicrobeMS are demonstrated using a test data set from highly pathogenic bacteria (HPB) which has been obtained as part of a published European Union (EU)-funded External Quality Assurance Exercise (EQAE). MicrobeMS and HPB test data can both be downloaded from https://wiki.microbe-ms.com/. The goal of this software publication is twofold: to raise awareness of MicrobeMS within the scientific community and to encourage the testing of the software and custom-developed MALDI-ToF MS databases of the RKI, which are published at the ZENODO data repository (https://doi.org/10.5281/zenodo.7702374).

4
VX: an AI-enabled desktop genome viewer and transcriptome browser with a programmable analysis framework

Shirokikh, N. E.; Cleynen, A.

2026-05-20 bioinformatics 10.64898/2026.05.17.725790 medRxiv
Top 0.1%
1.5%
Show abstract

BackgsroundGenome and transcriptome browsers are central to the interpretation of high-throughput sequencing data, but todays tools assume a human operator at a graphical interface and offer only limited programmability. As large-language-model assistants become routine in bioinformatics [Anthropic, 2024], this creates a bottleneck: agents cannot observe the visual state of the browser or drive it through the same interface as the human user, and analyses remain fragmented across a separate ecosystem of external tools. Transcript-coordinate data, produced by ribosome profiling [Ingolia et al., 2012] and direct RNA sequencing [Garalde et al., 2018], is also awkwardly supported in chromosome-oriented viewers. ResultsWe present VX, a desktop genome and transcriptome viewer written in D, using GTK 3 and OpenGL, that handles genome-scale and transcriptome-scale data in a unified interface. VX exposes its full functionality through an embedded HTTP API on the loopback interface and a Model Context Protocol server of currently thirty-nine tools, so that scripts and LLM agents can load data, navigate, manage tracks, run analyses, and capture figures through the same contract used by the GUI. An integrated analysis framework provides more than fifty analyses and includes signal processing and peak calling, quantification, variant analysis, alignment statistics, interaction and cross-track comparisons, all with an explicit four-level scope hierarchy running from viewport to whole dataset; results are written to disk and, where appropriate, added as new tracks. Additional features include a magnifier popup for base-resolution inspection (Alt+hover), chromosome-alias resolution across UCSC, Ensembl, and NCBI conventions, viewport video recording via an ffmpeg pipe, and INI-based configuration. ConclusionsVX complements existing desktop and web browsers by providing a native agent-control layer, an integrated analysis framework, and first-class transcriptspace handling. The binary is freely available for non-commercial use; the HTTP API and MCP protocol are fully specified in this article, so third-party clients can be written independently of the core implementation.

5
DigitalPedon: A Novel Digital Twin Framework for Soil Profile Monitoring and Global Soil Data Interoperability

Youssef, A.; Badreldin, N.

2026-05-08 bioengineering 10.64898/2026.05.05.722891 medRxiv
Top 0.1%
1.3%
Show abstract

The Digital Pedon (DP) is an open-source Python framework that represents a soil profile as a continuously updated digital twin, bridging three persistent gaps in soil science: disconnected models and observations, cross-database interoperability, and the inference gap between raw sensor signals and agronomically meaningful variables. Integrating real-time sensor streams, model-based solver chains (Model-Zoo), GLOSIS-compliant ontology mapping, and a novel LLM agentic interface layer enabling natural language soil queries, the DP supports applications spanning precision agriculture, digital soil mapping, and environmental sustainability assessment. Four proof-of-concept experiments confirm automatic profile initialisation fidelity, solver chain consistency, ontology compliance, and user-defined solver extensibility.

6
LIVIA: a browser-based tool for assessing and visualizing predicted protein interactions

Kim, A.-R.; Perrimon, N.

2026-05-10 bioinformatics 10.64898/2026.05.01.721633 medRxiv
Top 0.2%
0.9%
Show abstract

As protein structure prediction tools become widely adopted across biology, there is a growing need for accessible methods to assess and visualize predicted protein-protein interactions (PPIs). Here we present LIVIA (Local Interaction Visualization and Analysis), a browser-based tool that computes local PPI confidence metrics across multiple prediction platforms, identifies predicted interface residues, embeds an interactive Mol* 3D viewer, and generates visualization scripts for ChimeraX and PyMOL. The tool automatically detects prediction formats; all parsing and computation occur locally on the users machine. LIVIA is freely available at https://flyark.github.io/LIVIA.

7
RAPID: an interactive R/Shiny platform for end-to-end 16S rRNA and ITS amplicon sequence analysis using DADA2

Kapoor, B.; Cregger, M. A.; Ranjan, P.

2026-05-08 bioinformatics 10.64898/2026.05.05.723040 medRxiv
Top 0.2%
0.9%
Show abstract

MotivationAmplicon sequencing of 16S rRNA and internal transcribed spacer (ITS) gene regions is the most widely used approach for characterizing bacterial and fungal communities, respectively. The DADA2 pipeline has become a standard for inferring amplicon sequence variants (ASVs), offering single-nucleotide resolution over traditional OTU clustering. However, executing the full DADA2 workflow requires proficiency in R programming and manual coordination of multiple sequential steps, presenting a substantial barrier for researchers in clinical, environmental, and agricultural sciences who lack computational training. ResultsWe present RAPID (R-based Amplicon Pipeline for Interactive DADA2), a pair of R/Shiny applications providing complete graphical user interfaces for 16S rRNA and ITS amplicon sequence analysis. The 16S application implements a 10-step guided workflow from raw paired-end FASTQ files through quality filtering, error learning, dereplication, paired-read merging, chimera removal, taxonomy assignment (SILVA), phyloseq construction with data transformation (rarefaction, relative abundance, or CLR), interactive visualization (rarefaction curves, alpha diversity, NMDS, PCoA, taxonomic abundance), PERMANOVA, and ANCOM-BC2 differential abundance analysis. The ITS application extends this to an 11-step workflow, adding an automated primer removal step using cutadapt with support for multiple primers and length-variable amplicons, and uses the UNITE database for fungal taxonomy. Both applications feature asynchronous background processing, session persistence, real-time progress monitoring, publication-ready figure export, and comprehensive result downloads. AvailabilityRAPID is freely available at https://github.com/beantkapoor786/RAPID. Both applications can be installed locally on any system with R (version 4.0 or higher) and run as local web applications accessible through a standard browser.

8
cran2crux: automatically create CRUX ports for R-packages

Petrov, P.; Izzi, V.

2026-05-13 bioinformatics 10.64898/2026.05.09.723963 medRxiv
Top 0.2%
0.8%
Show abstract

MotivationR together with CRAN and Bioconductor provides one of the richest ecosystems for bioinformatics and computational biology, with thousands of specialized packages. While GNU/Linux is a vastly-used operating system in this field, R-packages are typically managed independently of the systems native package manager. This separation makes installation, updates and mass rebuilds cumbersome. CRUX, a minimalist semi-source GNU/Linux distribution, offers great flexibility with its ports-based system for the seamless integration of R-packages with its native package manager. ResultsThe hereby presented cran2crux tool automatically generates CRUX ports for packages from both CRAN and Bioconductor. It performs recursive dependency resolution, handles naming conventions, extracts dependencies information, and supports inclusion of optional dependencies. The tool also provides convenient functions for checking updates and regenerating outdated ports. It can generate over 140 ports for complex packages such as Seurat in approximately 11 seconds, dramatically simplifying the maintenance of large R-dedicated repositories on CRUX. Availabilitycran2crux is available under the MIT license at https://github.com/izzilab/cran2crux. As of now, more than 650 R package ports, generated with the tool, are available in the CRUX ports database.

9
BAT: an integrated pipeline for gene tree construction, annotation, and functional inference

Sheppard, B. D.; Behnken, B.; Steinbrenner, A.

2026-05-12 bioinformatics 10.64898/2026.05.07.721474 medRxiv
Top 0.2%
0.8%
Show abstract

Gene family functional exploration often requires analyzing motifs, domains, and associated datasets (e.g. gene expression) in the phylogenetic context of a gene tree. As genomic resources become more abundant, local pipelines are needed to analyze gene families of interest with project-specific resources. Here we present BLAST-Align-Tree (BAT), a bioinformatic pipeline for automated gene family phylogeny construction and annotation to enable gene tree exploration. BAT combines a BLAST search of local genome databases with a robust and flexible gene tree construction pipeline that enables multiple modes of annotation. Output visualizations display experimental datasets, custom regex specified amino acid motifs, and protein HMM domain annotations. For flexibility, BAT runs locally and is independent of pre-existing databases, allowing the easy incorporation of custom genomes and datasets. Three primary case studies described here demonstrate the utility of BAT for inferring the function of homologs and orthologs within characterized gene families. BAT is suitable for fine scale phylogenomic analysis of gene families across the tree of life, and default genomes available on installation span model eukaryotes.

10
Efficient and Tidy Manipulation of Annotated Matrix Data with plyxp

Landis, J. T.; Love, M. I.

2026-05-11 bioinformatics 10.64898/2026.05.06.721669 medRxiv
Top 0.2%
0.7%
Show abstract

Manipulating high-dimensional omics data, such as bulk or single cell gene expression counts matrices, typically requires a bioinformatics analyst to learn domain-specific functions and syntax. These matrix-centric functions and syntax can be less intuitive than working with tidy data analytic principles, as exemplified by tools such as dplyr applied to tabular data. We propose an expressive grammar for manipulating annotated matrix data, with syntax to access, modify, and append matrix data and tabular row and column metadata, including row-wise or columnwise grouped operations. This grammar defines multiple contexts, and providing pronouns for specific recall and assignment within and across these contexts. The plyxp package is an implementation of this grammar for the R/Bioconductor ecosystem, with efficient abstractions for the SummarizedExperiment class. We demonstrate plyxps efficiency compared to alternative approaches on data manipulation tasks requiring computation across contexts.

11
kinference: Pairwise kinship detection for Close-Kin Mark-Recapture

Bravington, M. V.; Baylis, S. M.; Eveson, P.; Feutry, P.

2026-05-21 genetics 10.64898/2026.05.18.725841 medRxiv
Top 0.3%
0.7%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWClose-Kin Mark-Recapture (CKMR) is a statistical framework for estimating demographic parameters of wild populations. Instead of recapturing individuals, it relies on the identification of closely-related pairs such as parents and offspring, or siblings. By measuring how often such close-kin are "recaptured" among sampled animals (whether alive or dead), scientists can estimate demographic parameters such as census size, mortality rates, and connectivity. CKMR is starting to change fisheries and wildlife management by giving more reliable demographic information, even for many species that resist conventional approaches. Here we introduce the kinference R package, which provides a set of tools for finding close-kin pairs among thousands of samples each genotyped at thousands of SNPs, and for associated quality control. The CKMR context implies different requirements and assumptions to many other kinship programs. In particular, kinference accounts empirically for linkage without requiring a genome assembly, is able to estimate and control false-negative and false-positive probabilities, and can cope with null alleles. The package has been developed and used in numerous CKMR projects since 2017. This paper documents the assumptions, statistical algorithms, and intended workflow for kinference.

12
metaJAM: a Nextflow integrated metagenomic workflow for sedimentary ancient DNA

Johnson, E.; Jin, C.; Guinet, B.; Alumbaugh, J.; Martin, N. L.

2026-05-07 bioinformatics 10.64898/2026.05.05.722689 medRxiv
Top 0.3%
0.6%
Show abstract

The application of metagenomics in ancient DNA (aDNA) research is rapidly expanding, driven in particular by advances in sedimentary aDNA research and sequencing technologies. Although many ancient DNA studies rely on broadly similar bioinformatic strategies, there is still no single standardized, widely adopted workflow. These differences can directly affect how efficiently past biodiversity can be reconstructed and authenticated from the various archives analyzed using ancient metagenomic approaches. Although a few pipelines tackle the processing of ancient DNA data from shotgun sequencing, the ones applied to metagenomic datasets are scarce and often resource-intensive or challenging to install, update, or extend with new tools and parameters. metaJAM, a scalable and user-friendly pipeline, is presented here to specifically address the challenges of metagenomic aDNA analyses of eukaryotes. The pipeline has been designed in Nextflow to ensure continuous development and can be used on different high-performance computing (HPC) clusters. metaJAM integrates all key steps required for ancient DNA metagenomic analyses, from raw sequencing data pre-processing to microbial filtering, taxonomic assignment via competitive iterative mapping against Bowtie 2 reference indexes and reassignment using lowest common ancestor (LCA) inference. Validation and authentication are performed using the post-LCA toolkit bamdam together with alignment to an exhaustive reference database using MMseqs2. It allows users to choose among alternative tools and generates a series of plots to support data visualization and taxon authentication. metaJAM differs from existing pipelines through its implementation of rigorous filtering of microbial-like reads by Kraken 2 classification and masking microbial-like regions, iterative or parallel Bowtie 2 mapping, validation of the detected taxa and integration of up-to-date tools for ancient metagenomic analysis, along with diagnostic plots that help users assess the reliability of taxonomic assignments and visualize their data. It complies well with limited computational resources, customised databases for taxonomical groups, and provides an accessible workflow to support the investigation of metagenomic ancient DNA datasets. Its applications span a range of contexts, from ecosystem reconstructions in environmental aDNA archives such as sediments, to metagenomic studies on archaeological artefacts and even taxonomic identification of undiagnosed biological materials.

13
Benchmarking long-read simulators against Oxford Nanopore whole-genome sequencing data

Taouk, M. L.; Ingle, D. J.; Wick, R. R.

2026-05-11 bioinformatics 10.64898/2026.05.06.723380 medRxiv
Top 0.3%
0.6%
Show abstract

BackgroundOxford Nanopore Technologies (ONT) sequencing is increasingly used for whole-genome sequencing (WGS) across a wide range of applications. However, the platform has evolved rapidly through updates to flow cell chemistry and basecalling algorithms, altering the characteristics of the resulting sequencing data. Read simulators provide synthetic datasets with known ground truth, enabling controlled development and evaluation of methods. However, many existing simulators were developed for earlier versions of ONT sequencing or use generic long-read assumptions, and their realism for contemporary ONT data is unclear. ResultsWe benchmarked six ONT-compatible read simulators (Badread, LongISLND, lrsim, NanoSim, PBSIM3 and SimLoRD) using a microbial genome reference and ONT R10.4.1 reads as the empirical standard. Each tool was configured to maximise realism, including training on empirical reads when supported. We compared simulated and real datasets with respect to read length, read accuracy, FASTQ quality scores and sequence error profiles. No simulator reproduced all metrics of the real data well. PBSIM3 most closely reproduced read length, read accuracy and FASTQ quality scores, making it a strong simulator for broad read-level realism. However, it did not capture important features of the real error profile, including context-dependent substitution rates and homopolymer-length errors. Badread and LongISLND better reproduced some aspects of the error profile, but showed other departures from the real data. ConclusionPBSIM3 is a good general-purpose choice for many ONT WGS simulation tasks because it reproduced several key read-level properties well. However, Badread or LongISLND may be preferable for applications where error structure is more important. No evaluated tool was realistic across all tested metrics, highlighting a gap for improved long-read simulators.

14
Digital Atlases to Unlock the Potential of Brain Biorepository Tissues for Interdisciplinary Research

Webster, J. M.; Shojaie, A.; Shen, Y. A.; Le, T.; Ragaglia, E.; Bogdani, M.; Kirkland, A.; Mac Donald, C.; Latimer, C. S.; Keene, C. D.; Grabowski, T. J.

2026-05-15 neuroscience 10.64898/2026.05.13.724753 medRxiv
Top 0.3%
0.5%
Show abstract

Human brain tissue preserved in biorepositories is foundational for the structural, cellular, and biomolecular research necessary for a mechanistic understanding of neurological diseases. Realizing the research potential of these valuable resources requires well-characterized research-relevant tissue that can be efficiently identified by investigators and incorporated into the conceptual and computational frameworks of interdisciplinary research. Several large-scale efforts to improve research reliability and reproducibility have sought to characterize and annotate the processes by which these samples are collected, yet limited progress has been made on standardizing spatial information for these samples. Biorepositories systematically collect brain tissue according to a brain sampling protocol (BSP) that differs between institutions, yet explicit spatial information regarding the samples may not be documented in standard operating procedures (SOPs). The amount of anatomical location details available to investigators are inconsistent across biorepositories and typically lack sufficient anatomical precision to ensure correspondence with samples from other biorepositories or research relevant brain regions specified by neuroimaging, functional, or disease-susceptibility criteria. Here, we introduce a pipeline for developing a Spatial Atlas for Mapping Protocol Locations of Ex vivo Samples (SAMPLES), which uses a neuroimaging framework to create a 3D representation of a BSP through a metrically precise digital instantiation of the procedures for brain extraction, segmentation, slicing, and sampling on a modern digital brain template. SAMPLES incorporates modern neuroinformatics conventions to create explicit 3D labels of BSP-defined samples that can be interactively visualized with freely available neuroimaging software. We illustrate the pipeline by developing an atlas for the protocol from the University of Washington BioRepository and Integrated Neuropathology laboratory (UW BRaIN SAMPLES). By providing an explicit, computable reference, SAMPLES atlases can support the efficient identification, referencing, and utilization of postmortem samples for interdisciplinary research. These capabilities enable biorepository workflows, data harmonization across biorepositories, and integration with antemortem neuroimaging.

15
ANYI: The ANnotated Yeast Interactome

Nissley, D. A.; Goel, M.; Castellanos-Girouard, X.; Kuntz, C. P.; Wang, Y.; Mukhtar, S.; Serohijos, A.; Schlebach, J. P.

2026-05-05 bioinformatics 10.64898/2026.04.30.721908 medRxiv
Top 0.4%
0.5%
Show abstract

Although several existing protein-protein interaction (PPI) databases provide yeast PPI data, none unify large-scale network topology information with detailed biophysical, proteostasis, and regulatory annotations in a single protein-centric framework. To address this gap, we developed the ANnotated Yeast Interactome (ANYI), an open, integrated resource that combines experimental yeast PPIs with sixteen feature annotation types, including protein abundance, half-life, disorder content, post-translational modifications, conformational stability, chaperone interactions, sequence, and structure. ANYI integrates 3,927 proteins with 155 annotation features, forming a unified matrix that enables systematic cross-layer analyses. Available via GitHub and Docker Hub with an interactive network browser for broad accessibility, ANYI provides both experienced and beginner computational scientists with tools to investigate the yeast interactome. For example, users can directly test whether highly connected hub proteins exhibit distinct stability, disorder, or proteostasis signatures relative to peripheral nodes. AVAILABILITY AND IMPLEMENTATIONThe code used to assemble ANYI is available on GitHub at https://github.com/NCEMS/energetic-origins-of-PPI-connectivity and the database itself and interactive browser tool are available on Docker Hub as dannissleypsu/anyi-browser:v1.0.2.

16
pyKinaXe: a fast and robust turnkey kinase activity profiler with high resolution

Wuttke, D.; Hildt, E.; Kolesnichenko, P. V.

2026-05-15 bioinformatics 10.64898/2026.05.12.724658 medRxiv
Top 0.4%
0.5%
Show abstract

MotivationPeptide microarray technologies such as PamGenes enable direct measurement of peptide phosphorylation by upstream kinases, yet extraction of kinases from raw data depends on proprietary software or separate open-source alternatives delivering time-consuming processing across a variety of different steps, limiting throughput for experimental large-scale kinome generation in clinical and research settings. ResultsWe developed pyKinaXe, a Python package for automated end-to-end analysis of PamChip(R) data, integrating robust image processing, quantification of phosphorylation kinetics, multi-database substrate-kinase mapping, and upstream kinase analysis into a single one-click pipeline. Validation on a selected published benchmark dataset recovered 76-89% of the signaling pathways for previously reported significantly deregulated kinases. Processing time was reduced on the same data from over 30 minutes to[~] 25 seconds, leading to a 75-fold speed increase compared to other open-source alternatives. Thus, pyKinaXe addresses the key limitations of existing peptide-microarray-based kinase activity inference tools (slow inference, fragmented workflows, and poor usability) enabling fast and robust analysis, and facilitating high-throughput experiments and large-scale kinome profiling. Availability and implementationpyKinaXe is implemented in Python 3.13 and distributed under the Apache 2.0 License. Source code, documentation, and installation instructions are freely available at https://github.com/pykinaxe/pyKinaXe. The benchmark data is available at Mendeley Data (doi: 10.17632/ynp7f92n47.1). A pyKinaXes user-friendly web-based interface can be accessed at https://pykinaxe.github.io/home.

17
BatchVaria: a variance-aware framework for evaluating batch correction in high-dimensional omics data

Moir, N.; Sherwood, K.; Simpson, I.

2026-05-12 bioinformatics 10.64898/2026.05.07.721996 medRxiv
Top 0.4%
0.5%
Show abstract

SummaryBatch effects and other unwanted technical sources of variation remain a persistent challenge in the integrative analysis of high-dimensional-omics data. Although established methods such as ComBat effectively mitigate batch-associated signal, their impact on biologically meaningful variation is frequently evaluated in an ad hoc and non-quantitative manner. This is particularly problematic in heterogeneous disease contexts, such as breast cancer transcriptomics, where technical and biological sources of variation may be partially confounded. We present BatchVaria, an R package that implements a variance-aware framework for batch correction and post-adjustment evaluation. BatchVaria integrates variance component modelling, batch adjustment, and systematic re-profiling within a unified analysis container, enabling iterative quantification and reassessment of technical and biological variance contributions while preserving analytical provenance. By supporting multiple variance profiling engines and structured storage of intermediate results, BatchVaria facilitates transparent and reproducible evaluation of batch correction strategies. We demonstrate the utility of BatchVaria using a publicly available breast cancer transcriptomic dataset with known covariate-driven structure, illustrating how iterative variance profiling can guide responsible batch correction without erosion of subtype-associated biological signal.

18
Machine learning-based prediction of memory requirements for metagenomic assembly in high-performance computing environments

Zierep, P. F.; Faack, S.; Beracochea, M.; Sanchez, S.; Batut, B.; Finn, R. D.; Gruening, B. A.

2026-05-13 microbiology 10.64898/2026.05.12.724571 medRxiv
Top 0.4%
0.5%
Show abstract

Metagenomic assembly can be a computationally intensive step in microbiome analysis, with memory requirements that vary widely depending on input data characteristics. In workflow systems like Galaxy and large-scale platforms like MGnify, which run thousands of heterogeneous jobs, inaccurate memory allocation drives job failures and costly retries when underestimated, and reduces throughput when overestimated. Current approaches rely primarily on heuristic rules based on input file size or sample metadata, which often fail to generalize across diverse datasets. In this study, we present a machine learning-based framework for predicting memory requirements of metagenomic assembly using metaSPAdes. We analyzed 300 assembly jobs from diverse biomes and evaluated 18 predictive models using combinations of input file size, biome classification, and sequence-derived k-mer features. K-mer profiles were computed from raw sequencing data and summarized into statistical descriptors capturing sequence complexity and diversity. Model performance was assessed using both conventional regression metrics and a production-oriented cost function that accounts for retry policies and resource waste in high-performance computing environments. Our results show that machine learning models can outperform commonly used heuristics. In particular, models incorporating biome information achieved the best performance and can be tuned to favor conservative predictions that reduce job failure rates. Simpler models based solely on input file size also performed competitively, offering a practical alternative for systems with limited feature availability. When evaluated under realistic workload distributions, predictive approaches reduced total memory waste by several million gigabyte-hours per 1,000 jobs compared to static allocation strategies. These findings demonstrate that data-driven resource prediction can substantially improve efficiency in metagenomic workflows. The proposed framework is adaptable to different computational environments and provides a foundation for integrating predictive resource allocation into large-scale bioinformatics platforms beyond Galaxy.

19
SNPWay: streamlined SNP-to-function and pathway over-representation analysis

Queme, B.; Kakkar, A.; Muruganujan, A.; Thomas, P. D.; Gauderman, W. J.; Mi, H.

2026-05-06 bioinformatics 10.64898/2026.05.03.722523 medRxiv
Top 0.4%
0.5%
Show abstract

MotivationPost-GWAS interpretation frequently requires translating variant lists (e.g., lead SNPs, clumped loci, credible sets, or curated panels) into pathway and functional hypotheses. In practice, obtaining pathway and functional over-representation results from SNP inputs often requires stitching together multiple tools for variant annotation, regulatory annotation, gene identifier handling, and statistical testing. This integration burden can reduce reproducibility and restrict end-to-end analysis to groups with dedicated bioinformatics support. SummaryWe present SNPWay, a web server and R package that performs end-to-end SNP-to-function and pathway over-representation analysis in a single standardized workflow. SNPWay accepts rsIDs, VCF files, or hg19/GRCh37 genomic coordinates. It queries Annotation Query (AnnoQ) to obtain SNP-to-gene mappings from ANNOVAR, SnpEff, and VEP under both Ensembl and RefSeq gene models, and incorporates enhancer-gene links via PEREGRINE to augment mappings for noncoding variants. SNPWay aggregates mapped genes into a single, non-redundant, combined gene list and submits it to PANTHER for over-representation testing against the Homo sapiens reference list, returning over-represented pathways and functional categories (e.g., Gene Ontology) with direct links for interactive exploration in PANTHER. SNPWays modular architecture is designed for extensibility, enabling incorporation of additional analysis methods in future releases. A step-by-step walkthrough is provided in Supplementary Data. Availability and implementationWeb server: https://snpway.annoq.org/. R package and source code: https://github.com/USCbiostats/Annoq_Overrepr_Workflow. Documentation: https://snpway.annoq.org/about. Examples: The website contains Sample files for the input formats, also provided in Supplementary Data. SNPWay is free to use with no mandatory login. Contacthuaiyumi@usc.edu Supplementary informationSupplementary data are available at Bioinformatics online

20
BGC-QUAST: a quality assessment tool for genome mining software

Kushnareva, A.; Tupikina, D.; Almessady, H.; McHardy, A.; Gurevich, A.

2026-05-07 bioinformatics 10.64898/2026.05.04.722653 medRxiv
Top 0.4%
0.5%
Show abstract

SummaryBiosynthetic gene clusters (BGCs) encode microbial natural products, many of which have important ecological and biomedical roles. Genome mining tools enable large-scale BGC prediction, but their outputs differ substantially, complicating comparison and interpretation. We present BGC-QUAST, a framework for evaluating and comparing BGC predictions across three analysis modes: comparison across samples, assessment of BGC recovery in draft assemblies relative to reference genomes, and comparison of predictions from different tools using overlap analysis. BGC-QUAST provides standardized metrics, interactive visualizations, and integrated outputs for joint inspection of predictions, enabling the comprehensive comparison of genome mining results and facilitating sample prioritisation based on biosynthetic potential. Availability and implementationBGC-QUAST is publicly available at https://github.com/gurevichlab/bgc-quast